-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DresdenCodak: Fix and improve scraper #230
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #230 +/- ##
==========================================
- Coverage 82.10% 81.75% -0.35%
==========================================
Files 79 79
Lines 6649 6687 +38
Branches 525 536 +11
==========================================
+ Hits 5459 5467 +8
- Misses 1071 1101 +30
Partials 119 119 ☔ View full report in Codecov by Sentry. |
Is code coverage of scrapers a requirement or just a helpful hint for maintainers to keep an eye on things? |
The coverage is just advisory. It would be nice to add some tests for more complicated scrapers (I broke some scrapers by refactoring and only noticed month later), but not required. If you want to add tests, you can find some inspiration in https://github.com/webcomics/dosage/blob/master/tests/test_modules.py |
I'm really not sure if I want to allow modules to mess with sub-directories... (I'm not totally opposed, but also not really convinced that the feature is that useful...) I'm pretty sure some modules actually rely on Dosage stripping the path from the "target" file name, so that's one reason I'm a bit uncomfortable with this change... |
I understand, but I do really want subdirectories (and spaces in names), hence the suggestion of having a bit of a double hierarchy of objects. So we can leave everything as is but things that want to make use of the feature can use |
The path separators are necessary if we want to be able to organize comics into books or chapters or categories. The spaces are nice to have. I assume they haven't been allowed because they require escaping in a terminal but any modern terminal provides assistance for that during tab completion. Disallowing spaces here means they can never be used. Instead a method could be added to an appropriate class to sanitize filenames and replace spaces by default if that is actually desired.
If the filename is a path we need to ensure all its components exist to avoid errors later when we try to create the file at that path.
The scraper was broken due to the site layout changing. The structure is lacking so there are many unique cases to deal with. As the comic is separated into two storylines, one finished and one ongoing, and a series of one-offs that don't fit in either storyline I've put each of the series into its own subdirectory.
The scraper was broken due to the site layout changing. The structure is lacking so there are many unique cases to deal with.
As the comic is separated into two storylines, one finished and one ongoing, and a series of one-offs that don't fit in either storyline I've put each of the series into its own subdirectory.
To make this possible I've changed how the
comicdir
is guaranteed to exist, taking into account path components in thefilename
and I've added path separators in the allowed characters for filenames.I've also added space as an allowed character in filenames because I want them and there's not really a reason to disallow them. This together with allowing both
/
and\
could be a breaking change, since scrapers that generate file names containing them will no longer have them filtered out.If that's not viable then maybe we need a new hierarchy of objects that do allow filenames to specify subdirectories and then scrapers can opt into it one at a time.